Cellular proliferation biases clonal lineage tracing and trajectory inference

Abstract Motivation Lineage tracing and trajectory inference from single-cell RNA-sequencing data hold tremendous potential for uncovering the genetic programs driving development and disease. Single cell datasets are thought to provide an unbiased view on the diverse cellular architecture of tissues. Sampling bias, however, can skew single cell datasets away from the cellular composition they are meant to represent. Results We demonstrate a novel form of sampling bias, caused by a statistical phenomenon related to repeated sampling from a growing, heterogeneous population. Relative growth rates of cells influence the probability that they will be sampled in clones observed across multiple time points. We support our probabilistic derivations with a simulation study and an analysis of a real time-course of T-cell development. We find that this bias can impact fate probability predictions, and we explore how to develop trajectory inference methods which are robust to this bias. Availability and implementation Source code for the simulated datasets and to create the figures in this manuscript is freely available in python at https://github.com/rbonhamcarter/simulate-clones. A python implementation of the extension of the LineageOT method is freely available at https://github.com/rbonhamcarter/LineageOT/tree/multi-time-clones.

In this work, we identify a fundamental statistical bias that emerges from sampling cell lineage barcodes across a time course.We consider the setting in which cell state and lineage barcodes may be measured simultaneously (i.e. for the same cells) (Raj et al. 2018, Spanjaard et al. 2018), and copies of the same lineage barcodes may be observed over multiple time points (e.g. by repeated sampling from the same population).While these data have been collected and analyzed for some time (Biddy et al. 2018, Hurley et al. 2020, Weinreb et al. 2020), this bias has not yet been reported.
We focus on the case of static lineage barcodes, which are incorporated into the genome at an early stage of the developmental process, and then inherited by daughter cells but not modified over time.Static lineage barcodes allow the tracing of clones: groups of cells that share a common barcode and therefore all share a common ancestor at the time of barcoding.If a clone is observed at only a single time point we will refer to it as a single-time clone, otherwise we will refer to it as a multi-time clone.The distinction between multitime and single-time clones is important for understanding the source of the bias we discuss in this paper, and codifies clones by the level of information that they provide on developmental trajectories.Multi-time clones are particularly valuable for trajectory inference because they contain direct observations of developmental transitions.Indeed, recent methods including CoSpar (Wang et al. 2022) are designed to leverage multi-time clonal barcodes to infer cell fates.
We present a mathematical analysis that proves that the relative abundance of subpopulations is changed, or biased, in multitime clonal datasets.The source of the bias is heterogeneous growth rates; cells with more descendants are more likely to be represented in multi-time clones.Therefore, more proliferative subpopulations are over-represented in multi-time clonal datasets.We prove the existence of this effect by simple analysis of probabilities, validating our arguments with proportions obtained from simulations as well as real data.We further show that the performance of trajectory inference methods such as CoSpar, which rely on this biased information, may be negatively impacted by the presence of this sampling bias.
Knowledge of this statistical phenomenon can help guide the design of robust trajectory inference methods.Beyond considering the estimates from CoSpar in the presence of this bias, we also considered the LineageOT trajectory inference method (Forrow and Schiebinger 2021).Since LineageOT uses only single-time clone information, we first introduce an extension of LineageOT, which we will refer to as LineageOT-MT, which additionally incorporates information from multi-time clonal barcodes.We test the extension on a barcoded time course of T-cell induction (see separate paper by Michaels et al. 2023), and also simulated datasets for which the ground-truth is accessible.We find that this extension offers a robust improvement over the original LineageOT at sufficiently high sampling rates, which validates the value of multi-time clonal information.Further, we find LineageOT-MT may be negatively impacted by the presence of the sampling bias.Given the potential for the application of trajectory inference results to biomedical technologies and treatments, understanding and improving the accuracy of these methods is of critical importance.

Materials and methods
The remainder of the paper is organized as follows.In Section 2.1, we present a mathematical derivation of the sampling bias.In Section 2.2, we demonstrate that the performance of trajectory inference methods may be negatively impacted by analyzing predictions from CoSpar in the presence of this bias.In Section 2.3, we introduce LineageOT-MT, demonstrate improved performance over the original LineageOT, and analyze the predictions from LineageOT-MT in the presence of the sampling bias.We conclude in Section 3 with a Discussion, where we explore how state-of-the-art methods for lineage-informed trajectory inference may be developed to overcome the impact of this sampling bias.

Simulations for CoSpar analysis
Two variations, baseline and effect, of the population were simulated to separate the effect of the bias in the multi-time clones from the effect of the sampling rate.In the baseline, all growth rates are equal resulting in no bias effect.In the effect case, the two cell types at t 1 have different growth rate distributions, resulting in a bias in the t 1 cell type proportions in the multitime clones.Every cell of the first type has a growth rate of eight descendants, while 20% of the cells (selected uniformly at random) of the second type have a growth rate of eight descendants and 80% have a growth rate of two descendants.The heterogeneous growth rate distribution of the second cell type was used rather than all cells of this type having two descendants as this produced a more pronounced difference between the baseline and effect results for CoSpar-MT.The cells of the second type with eight descendants, rather than two, could represent cells with a distinct chromatin accessibility profile (between t 1 and t 2 ) to the rest of the cluster.
To control for the effect of the number of cells sampled and manage the computational resources required for trajectory inference, 1000 cells were sampled at t 1 and t 2 for every sampling rate.Sampling a fixed number of cells rather than sampling at a fixed rate also more realistically models experimental procedures, in which only a fixed budget of cells can be sequenced.Each specified t 1 sampling rate, r t 1 , is obtained by simulating a population of N t 1 cells where N t 1 ¼ d1000=r t 1 e.For the effect and baseline cases each, the results were generated from 30 simulated populations (one for each t 1 sampling rate tested), each containing between 2500 and 60 000 cells with population size scaling inversely to sampling rate.The true trajectory coupling was computed for each population, and then 1000 cells were sampled at each t 1 and t 2 .Next, CoSpar-MT was used to estimate the trajectory coupling between the t 1 and t 2 samples.

Simulations for LineageOT analysis
The simulation scenario for the LineageOT analysis is similar to the scenario for the CoSpar analysis.The main difference between the two scenarios is the distribution of the growth rates for the cells at t 1 in the effect case.In this scenario, they are sampled from a one dimensional linear gradient distribution over cell state space (maximum growth rate of 11.0, minimum growth rate of 0.008).Due to the position of the cell type clusters at t 1 , the mean growth of one type is 8.8 while the other is 1.6.This simulation scenario was selected over the scenario for the CoSpar analysis as it produced a more pronounced difference between the baseline and effect results for LineageOT-MT.

Cell growth introduces bias in multi-time clonal barcodes
In this section, we present two derivations of the statistical sampling bias in multi-time clone data.We begin by analyzing the sampling bias in a general population (Equation ( 1)), and then we examine our results in the setting of discrete cell types (Equation ( 2)).

General derivation of sampling bias
Consider a barcoded population of cells, where each cell is given a unique barcode at some initial time t 0 , and these barcodes are inherited by daughter cells with each cell division.After barcoding, the population develops over time and is then sampled at times t 1 and t 2 .Each sample is sequenced via scRNA-seq to determine the state (gene expression) of each cell as well as, if it is detected, the lineage barcode.
We are interested in how cells with different growth rates are represented in multi-time clones, where the multi-time clones are constructed from uniformly sampling at t 1 and t 2 as illustrated by the example in Fig. 1.Note that while we analyze a single pair of time points, the bias we discover may be greater for datasets with more than two time points (see Supplementary data-Appendix A2).
More precisely, for this derivation we will consider the set of cells in multi-time clones, denoted MT, which results from the sampling process as follows.Assume we have a population of N t 1 cells at t 1 and N t 2 cells at t 2 .Each cell x at t 1 gives rise to g(x) descendants at t 2 , as in the example shown in Fig. 1a.Denote the set of cells in the population at t 1 and t 2 by Mðt 1 Þ and Mðt 2 Þ, respectively.Using this notation, the sampling process may be described as: 1) Sample cells uniformly at t 1 and t 2 , as illustrated in Fig. 1b.Denote the sets of cells sampled by Remove cells for which a lineage barcode was not detected from each sample.Denote the sets of remaining cells in each sample as Remove cells in clones observed at only one time point (single-time clones), as in Fig. 1c.We denote the remaining cells by MT � ðBðt 1 Þ [ Bðt 2 ÞÞ.These are the cells in multi-time clones.
We now analyze the proportions of cell types in MT and find that these are different from the true abundances in the population at time t 1 .Similar derivations also show a bias in the abundances at t 2 (see Supplementary data-Appendix A1), and, for time series with more than two time points, a bias in the abundances for each sample time (see Supplementary data-Appendix A2).
We analyze the proportions of cells at time t 1 in MT by calculating the probability that a cell from the original population will be retained in MT.Denote the sampling rate (the proportion of cells sampled) by r ti 2 ð0; 1Þ for i ¼ 1, 2. Let b 2 ½0; 1� denote the probability that a lineage barcode is detected in a cell.We will refer to b as the barcode rate; it represents a combined rate of barcode insertion and detection.Denote the net growth between t 1 and t 2 of a single-cell x at t 1 by g(x).This means that g(x) is the number of descendants of cell x at time t 2 , where g(x) ¼ 0 if x dies between t 1 and t 2 .We will show that restricting to multi-time clones can bias the proportion of cells in a way that depends on the growth g(x) of the cells in the clone, sampling rates r t 1 and r t 2 , and barcode rate b.
We now derive an expression for the probability that a cell x at time t 1 will be in MT.To simplify the expression of the probabilities, we assume that the two samples are independent and that we are sampling with replacement.Let N ti denote the number of cells in the population at time Let gðcÞ be the sum of the growth rates of the cells at t 1 in clone c, i.e. gðcÞ ¼ P x : CðxÞ¼c gðxÞ.Finally, for any cell x, we denote the set of cells with the same barcode as CðxÞ ¼ fy : CðxÞ ¼ CðyÞg.Then, for any cell x at time t 1 , the probability that x is retained in MT is where The first line follows from our independent samples assumption, and the last line follows from approximating by sampling with replacement.Simplifying, we obtain Equation ( 1) shows how the probability that a cell at t 1 will be in a sampled multi-time clone depends on the growth rate of its clone relative to the other clones in the population at t 1 .

Sampling bias at the level of cell-types
We now analyze the sampling bias at the resolution of cell types.Consider the simplified setting in which: 1) Each clone contains cells of only one cell type at t 1 2) The growth rate is homogeneous across each cell type, i.e. for all cells x, y at t 1 of type l, gðxÞ ¼ gðyÞ.Denote this g(l).3) All clones of type l start with the same number of cells at t 1 , denoted m(l).
Under these simplifying assumptions, we have gðCðxÞÞ ¼ mðlÞgðlÞ for every cell x of type l.Then the bias in proportions

Cellular proliferation
of cell types at t 1 within MT is revealed by the following conditional probability, where the sum is over all possible cell types l.It is clear from Equation ( 2) that the proportions of the cell types at t 1 in MT may significantly differ from the true proportions.For example, consider a population with two cell types A and B where Suppose that the proportions of type A and B at t 1 are equal, and at each time point we sample 50% of the population and detect a lineage barcode in every cell that we sample.Then, by Equation ( 2), the expected proportions of the two cell types A and B at t 1 in MT are 2/5 and 3/5, respectively, in contrast to the 50:50 ratio of A:B that is present in the total sample at t 1 , as well as in the cells observed in single-time clones at t 1 .These proportions will depend on the growth rates, g(l) and m(l), of each cell type, the rates r t 1 and r t 2 of cell capture, and the barcoding efficiency b.
The derivations in Supplementary data-Appendix A1 show that the cell type proportions at t 2 in MT are similarly biased.However, at t 2 the probabilities depend on growth rates between t 0 and t 1 rather than between t 0 and t 2 as in Equation ( 2).In this work, we focused our attention on the impact of the bias in the t 1 proportions as a representative example of the effect.
To support our probabilistic reasoning, we conduct a simulation study to validate Equation (2).See Fig. 2 for the results, where we plot the true simulated proportions and the proportions we predict using Equation (2).The plot shows the cell type proportions in MT for a population with two cell types, denoted A and B, with N t 1 ¼ 2000, m(A) ¼ 2 and m(B) ¼ 4, g(A) ¼ 1 and g(B) ¼ 2, and varying sample rate r 2 ½2; 100%�.In this example, the true proportions of type A and B at t 1 are equal.The plot clearly shows that the simulated results concentrate around the predictions and that the bias in the proportions increases at a superlinear rate as sampling rate r decreases, as expected from Equation (2).As sampling at each time is without replacement in these simulations, this provides evidence that approximating by sampling with replacement in the derivation of Equations ( 1) and ( 2) is reasonable for estimating the t 1 cell type proportions in MT.The plot demonstrates that as sampling and barcode detection rates decrease, the bias in the cell type proportions in MT may diverge significantly from their true values.
Our investigations into this bias effect were motivated by a time course of statically barcoded blood cell development published in a separate paper (Michaels et al. 2023).We include here a summary (Fig. 3) of the sampled cell type proportions in different subsets of clones: all (i.e. cells where a barcode was detected), multi-cell, and multi-time clones.Multi-cell clones are those that contain more than a single cell.
From our derivations, we do not expect biased cell type proportions in the set of cells where a barcode was detected or those in a multi-cell clone.Indeed, in Fig. 3 the difference in proportions between the total sample and these two subsets is within what one would expect from unbiased random sampling.In contrast, the difference between the proportions in the total sample and those in multi-time clones is outside the expected variance of unbiased sampling.The bias we have identified may explain this difference and be driven by a higher relative growth rate of progenitor cells.Conclusively determining that this is the cause would require growth and death rate estimates that were not available for this dataset.This rate information is needed to eliminate the alternative explanation that the lineages of cell types with decreased proportions (e.g.myeloid at t ¼ 5) died out by t ¼ 10 at a significantly greater frequency than those of increased proportions (e.g.progenitor at t ¼ 5).
We conclude with four summary remarks related to this statistical phenomenon: (1) single-time lineage information exhibits no such bias, (2) there will be no bias when growth rates are equal across the population for all time, (3) the bias can present for clones sampled at just two time points, but it may increase with sampling at additional times (see Supplementary data-Appendix A2) and, (4) the cell type proportions in MT at each sample time may be biased depending on the growth rates in a similar manner as in Equation ( 2).These remarks and the results of this section apply to the proportions of any subdivision/clustering of cells into subpopulations that correlate with growth rates, cell types being one prominent example.

Bias in multi-time clonal barcodes can affect trajectory inference
Given this newly discovered sampling bias in multi-time clonal barcodes, a natural question is whether this bias impacts the current state-of-the-art methods which use this information for trajectory inference.
We investigate the answer to this question for CoSpar, a method that has shown promising results for datasets representing hematopoiesis, reprogramming and directed differentiation (Wang et al. 2022).CoSpar (coherent, sparse optimization), is designed around the assumption that trajectory couplings are sparse and "locally coherent," i.e. cells with similar states have similar fates.These two assumptions are built into the CoSpar method via a thresholding and "smoothing" step, respectively.Smoothing refers to performing linear diffusion on a weighted graph defined by the cell state "similarity."Similarity is defined by the transition probability of a random walk on a k-nearest neighbor graph (kNN) constructed from a principal component analysis embedding of the data (the same kNN used in the UMAP method).The smoothing and thresholding steps are applied iteratively to a coupling containing only intra-clone transitions until convergence (Wang et al. 2022).CoSpar features two different methods for lineage-informed trajectory inference.Both methods use cell state information, but one uses only single-time clonal data while the other uses only multi-time clonal data.We will refer to the latter as CoSpar-MT.
We focus on the impact of the bias on the inferred fate probabilities of the cells, a key prediction derived from trajectory couplings.Our results demonstrate that it is possible for the bias in the multi-time clone data to negatively impact the performance of CoSpar-MT on the task of fate probability prediction (see Fig. 4).The results were gathered for a simple simulated population (described in Materials and methods) with two cell types at t 1 and t 2 for which the ground truth fate probabilities are known.
The results in Fig. 4 show that the mean correlation decreases as the sampling rate decreases, and furthermore, once the bias effect is sufficiently large (around r t 1 b ¼ 40%) the mean correlation is lower in the effect case than the baseline case, with a difference of approximately 0.1.Since the error bars of the baseline and effect results overlap, we do not claim that correlation is lower in the effect case with high probability, but the trend of the mean is clear.This demonstrates that it is possible for the bias in multi-time clones to negatively impact fate probability predictions from state-ofthe-art methods.

LineageOT-MT: extension of LineageOT to use multi-time clonal data
LineageOT (Forrow and Schiebinger 2021) is an extension of the state-based WaddingtonOT framework (Schiebinger et al. 2019) to incorporate lineage information from single-time lineage trees (trees where all observed cells are from a single time point).WaddingtonOT estimates trajectory couplings as the solution to an entropically regularized optimal transport (EOT) problem between the cell state distributions at each neighboring time point, with the distribution at the earlier time weighted by the estimated number of descendants of each cell (Schiebinger et al. 2019).LineageOT uses the same EOT estimation framework but estimates a coupling by the solution to an EOT problem between the cell state distribution at the later time point and the estimated ancestor cell state distribution at the earlier time point.The ancestor cell states are estimated using a Gaussian graphical model of the cell states and conditioning on the states observed at the later time point.The graph is constructed by connecting cells in the same clone to a common ancestor at the time of barcoding (Forrow and Schiebinger 2021).The current LineageOT method does not make use of valuable lineage information encoded in multi-time clonal barcodes, as is available in static lineage barcoding with population subsampling across time points.This limitation motivated our extension of LineageOT to use both multi-time and single-time clonal barcode information.We refer to this extension as LineageOT-MT (The implementation of LineageOT-MT is publicly available on a fork of the LineageOT package at: https://github.com/rbonhamcarter/LineageOT/tree/multitime-clones.).Our approach in LineageOT-MT differs from the approach in CoSpar where either multi-time (as in CoSpar-MT) or single-time clonal barcodes (as in CoSpar-ST) are used but not both.
The extension requires no new mathematical theory and only a relatively small change in the implementation of the algorithms.We simply condition on the state of all observed cells in each clone in the ancestor estimation step in The plot shows that the mean correlation decreases as the sampling rate decreases and that once the bias effect is sufficiently large (around r t1 b ¼ 40%) the mean correlation is lower in the effect case than the baseline case with a difference of approximately 0.1.This demonstrates that it is possible for the bias in multi-time clones to negatively impact fate probability prediction.The results for fate B are identical and hence are omitted.

Cellular proliferation
LineageOT, rather than just those at the later time point.The Gaussian estimation now benefits from using the observed states of ancestral and present-time clonal relatives in addition to the descendant relatives.
Trajectory inference results from simulated datasets show that multi-time clonal information improves the performance of LineageOT.Figure 5 demonstrates the improvement possible for the task of fate probability prediction.The results from this figure describe prediction performance on a simulation scenario (described in Materials and methods) similar to those used in the analysis of CoSpar in Section 2.2.Additional results (see Supplementary data-Appendix B1) were gathered for the simulation described in Section 2.2, using over 1200 simulated datasets in total (ten replicates per sample, 30 t 1 sampling rates, at least 2 growth rate variations of the 2 simulation scenarios).LineageOT-MT achieved an equal or higher mean correlation with the true fate probability than LineageOT on all of these datasets.For further reference, the fate probability prediction results for LineageOT and CoSpar-ST on both simulation scenarios are included in Supplementary data-Appendix B2.This simulation scenario was selected for inclusion in this section as it produced a more pronounced difference between the baseline and effect results for LineageOT-MT in our subsequent analysis (Section 2.3.1)than the simulation scenario in Section 2.2.
The results in Fig. 5 show that at the highest rate tested of 80% our extension of the method increases correlation with the true fate probability from approximately 0.6 to over 0.9.The performance of LineageOT-MT over LineageOT was assessed on simulated datasets across a range of sampling rates, as shown in Fig. 5, to illustrate the effect of decreasing the available multi-time clonal barcode information on prediction performance.As the sampling rate decreases, the decline in performance of LineageOT-MT is approximately linear and converges toward the performance of LineageOT.This is expected since the amount of multi-time clonal information is decreasing to zero with the sampling rate.

LineageOT-MT
In this section, we explore the impact of multi-time clonal barcode bias on LineageOT-MT.This exploration is analogous to the study of the effect of the bias on CoSpar-MT in Section 2.2.
The results were gathered using baseline and effect variations (as defined in Section 2.2) of the simulation scenario that was used to benchmark the performance of LineageOT-MT above.The performance of LineageOT-MT over decreasing sample rates for these two variations is shown in Fig. 6.This figure demonstrates that the mean correlation decreases as the sampling rate decreases.Furthermore, for a relatively high range of sampling rates (from r t 1 b ¼ 0.8 to around r t 1 b ¼ 0.3), the mean correlation is lower in the effect case than in the baseline case, with a difference of approximately 0.1.These results demonstrate that it is possible for the bias in multi-time clones to negatively impact fate probability predictions from LineageOT-MT.
The lack of impact on LineageOT-MT at rates less than 0.3, where the size of the bias effect is larger, may be explained by the methodological "convergence" of LineageOT-MT to LineageOT as the sampling rate decreases.This convergence corresponds to a shift from using multi-time and single-time lineage information to only using single-time lineage information.The number of cells observed in multi-time clones decreases with the sampling rate at a faster rate than the number in single-time clones, resulting in this shift occurring before the sampling rate nears zero.Recalling that the single-time information is not biased, this convergence may counter the impact of the bias on cell fate predictions from LineageOT-MT.

Discussion
In this work, we consider the problem of developmental trajectory inference using lineage information derived from multi-time clonal barcodes.We have shown that in certain experimental settings, there may exist a bias in the proportions of cell types represented in the multi-time clonal barcode data.This sampling bias emerges for clones sampled at just two time points and may increase with sampling at  The plot shows that the mean correlation decreases as the sampling rate decreases, and that for sufficiently high sampling rate (greater than approximately r t1 b ¼ 30%) the mean correlation is lower in the effect case than the baseline case, with a difference of approximately 0.1.This demonstrates that it is possible for the bias in multi-time clones to negatively impact fate probability prediction.The results for fate B are identical and hence are omitted.The lack of impact on LineageOT-MT at rates less than 0.3, where the size of the bias effect is larger, may be explained by the methodological "convergence" of LineageOT-MT to LineageOT as the sampling rate decreases.
additional times.The experimental settings where this bias emerges are those with heterogeneous growth rates and a combined sampling and barcode detection rate of less than 100% for at least one sample in the time course.These are of course very general conditions, and we suspect this bias to be present in most multi-time clonal barcode datasets.
In addition, we showed that it is possible for this bias to negatively impact predictions made from inferred developmental cell trajectories.This was demonstrated on a state-of-the-art trajectory inference method, CoSpar (Wang et al. 2022), using simulated datasets for which the ground truth is known.Further evidence was provided by introducing an extension (LineageOT-MT) of the LineageOT trajectory inference method (Forrow and Schiebinger 2021) to use multi-time clonal barcode information and similarly demonstrating the possible negative impact of the presence of the bias.We note that there may be more sophisticated ways to integrate the multi-time clone data into the LineageOT method beyond our present approach that could yield further performance improvements and robustness.
Our studies on the impact of the sampling bias on fate probability prediction with CoSpar-MT and LineageOT-MT show that each method is sensitive to the bias for distinct scenarios.Certain features of each method mediate the robustness or sensitivity to the bias effect.Qualitatively, the smoothing step in CoSpar-MT (which is strongly dependent on the state space geometry of the sampled cells) seems to improve the robustness of the method to the bias effect in some settings.For LineageOT-MT, using cell state and single-time clonal barcode information in addition to multi-time appears to provide robustness to the bias effect.This conclusion is supported by these first two data sources being unbiased and the prediction accuracy of the method being unaffected in the low sampling rate regime.
Purely considering the trends observed with respect to sampling rate, LineageOT-MT may offer robustness to the bias effect in the lower sampling rate regime while CoSpar may offer robustness at higher rates.However, as our results showed that the performance of LineageOT-MT nearly converges to that of LineageOT at lower sampling rates, the benefit of this LineageOT extension is not clear in the low sampling rate regime where most real world experiments lie.Further work is needed to improve the integration of multitime clonal information into LineageOT to increase its performance at lower sampling rates.
Multi-time clonal data, even if known to be biased by this sampling effect, is too valuable not to leverage in the study of single-cell development.This point is evidenced in this paper by the robust improvement of LineageOT-MT over the original LineageOT.What is needed is a solution for correcting the bias.Our work suggests that the most promising avenue is to leverage the unbiased single-time clonal barcode and cell state data.
We observe that the smoothing step of the CoSpar algorithm could be considered a kind of label propagation or "pseudo-labeling" (Kim et al. 2020, He et al. 2021, Taherkhani et al. 2021) that leverages the unbiased cell state data.The smoothing effectively propagates the multi-time clone label information from cells to nearby cells (in state space) that were not observed in a multi-time clone.See Fig. 7 for an illustration of the effect of label propagation on a dataset.Label propagation is one form of semi-supervised learning that can be applied to barcoded single-cell datasets.Semi-supervised learning refers to inference in settings where only a subset of the training data is labeled.This is the case in our application, where some cells are labeled (by their clone) and some are unlabeled as a lineage barcode was not detected.For any label propagation approach, it will be important to account for growth rates in the propagation to generate labels that correct the bias in the multi-time clones.
Motivated by the concept of pseudo-labeling and the unbiased nature of single-time clonal data, one idea to improve the robustness of CoSpar-MT to the bias is to correct the multi-time clonal matrix in the algorithm (Wang et al. 2022) using iterative proportional fitting (IPF).IPF is also known as Sinkhorn's algorithm in the optimal transport literature.It is a procedure for fitting data to a set of marginals which are typically known to be more accurate than the those of the data itself.In the case of CoSpar-MT, the two marginals of the multi-time clonal matrix correspond to the proportion of clones present in each subpopulation at t 1 and t 2 .The idea is to use IPF to project the matrix to the subspace of matrices with the "correct" marginals, i.e. those given by the unbiased single-time clonal data.Further investigation is required to determine whether this correction would indeed eliminate the impact of the bias in any predictions made using CoSpar-MT.
The equations in Section 2.1, and studies on CoSpar-MT and LineageOT-MT in the subsequent sections, provide some indication of when the bias effect will impact trajectory inference.In the future, it would be useful to have a statistical test that quantifies the expected impact of the bias on trajectory inference.A practical statistical test should use only information available in standard datasets of lineage barcoded sequencing time courses, which importantly may not include precise estimates of cell growth rates.Another important step is the definitive identification of the bias effect and its impact on trajectory inference methods for a real dataset.We leave such developments to future work, having focused in this work on the mathematical details and validations only possible using simulated data.
We hope that our identification of this statistical phenomenon serves to inform practitioners using multi-time clonal barcodes of possible biases in the data, and guide future method development.Beyond their use in proving the existence of the bias, the probabilities derived in Section 2.1 can also be used as a tool by practitioners for approximating the size of the bias effect in a dataset.Our contributions beyond these points are our extension to LineageOT, and the above discussion of how methods might be developed to overcome the impact of this sampling bias.Such development is an important endeavor to allow practitioners to make use of the valuable information in multi-time clonal barcodes, without these concerns of biased results and conclusions.

Availability of data and materials
The simulated data will be shared upon reasonable request to the corresponding author or may be recreated following the instructions at https://github.com/rbonhamcarter/simulateclones.The cell type proportions data visualized in Figure 3 were provided Dr Yale S. Michaels by permission, and will be shared on reasonable request to the corresponding author with permission of Dr Yale S. Michaels.

Figure 1 .
Figure 1.Example demonstrating the sampling process for obtaining the set of cells in multi-time clones: an illustration of the sampling process for a simple population with two cell types, A and B, where A has growth 1 between t 1 and t 2 while B has growth 2. The squares indicate cells sampled in step 1, and the re-coloured lineages in (c) indicate the cells/lineages removed in step 2 (the two rightmost lineages of Type A, and rightmost lineage of Type B).The set of cells remaining in (c) are those in multi-time clones, denoted MT.

Figure 2 .
Figure 2. Restricting to multi-time clones introduces bias in cell type proportions.Simulated and predicted proportions of cell types A and B at t 1 are shown as a function of the barcode sampling rate (combined sampling and barcode rate).The predicted proportions are computed from Equation (2) with N t1 ¼ 2000 cells, m(A) ¼ 2, m(B) ¼ 4, g(A) ¼ 1 and g(B) ¼ 2. In this example, the true proportions of type A and B at t 1 are equal.The bias in the proportions increases as sampling rate r decreases, where r ¼ r t1 b ¼ r t2 b is plotted for 50 values between 100% and 2%.Two-hundred replicates of the sampling process were generated for each of the 50 values.

Figure 3 .
Figure 3. Sampled cell type proportions in a time course of statically barcoded blood cell development differ when restricting to the subset of cells in multi-time clones.The proportions of different cell types observed at two time points (t ¼ 5 and t ¼ 10) in different subsets of clones (barcode detected, multi-cell, and multi-time) are shown as colored areas in bar plots.The proportions at each time for the total sample, barcode detected and multi-cell clones are similar (variance within that expected from random sampling), whereas the cell type proportions in the multi-time clones are significantly different at both time points.In particular, at t ¼ 5, the proportion of progenitor cells is approximately 0.9 instead of being in the range 0.5-0.7,and the myeloid proportion is approximately 0.1 instead of being in the range 0.3-0.4.The difference for the proportions at t ¼ 10 is smaller, but the proportions in the multi-time clones are still the furthest from those of the total sample.

Figure 4 .
Figure 4.It is possible for the bias to impact fate probability predictions from CoSpar-MT: Pearson correlation coefficient between the true fate probability and the fate probability estimated from CoSpar-MT (the probability that each cell is fated to be type A at t 2 ) as a function of t 1 rate (combined sampling and barcode rate r t1 b) for b ¼ 0.8 fixed for the baseline and effect cases.Values are summarized as a mean across ten replicates with error bars given by the standard deviation.The plot shows that the mean correlation decreases as the sampling rate decreases and that once the bias effect is sufficiently large (around r t1 b ¼ 40%) the mean correlation is lower in the effect case than the baseline case with a difference of approximately 0.1.This demonstrates that it is possible for the bias in multi-time clones to negatively impact fate probability prediction.The results for fate B are identical and hence are omitted.

Figure 5 .
Figure 5. Multi-time information improves the fate probability predictions from LineageOT: Pearson correlation between the true fate probability and the fate probability estimated from LineageOT and LineageOT-MT for fate A at t 2 as a function of t 1 rate (combined sampling and barcoding rate r t1 b) for b ¼ 0.8.Values are summarized as a mean across 10 replicates with error bars given by the standard deviation.The results for fate B are identical and hence are omitted.

Figure 6 .
Figure 6.It is possible for the bias to impact fate probability predictions from LineageOT-MT: Pearson correlation between the true fate probability and the fate probability estimated from LineageOT-MT (the probability each cell is fated to be type A at t 2 ) as a function of t 1 rate (combined sampling and barcode rate r t1 b) for b ¼ 0.8 fixed for the baseline and effect cases.Values are summarized as a mean across 10 replicates with error bars given by the standard deviation.The plot shows that the mean correlation decreases as the sampling rate decreases, and that for sufficiently high sampling rate (greater than approximately r t1 b ¼ 30%) the mean correlation is lower in the effect case than the baseline case, with a difference of approximately 0.1.This demonstrates that it is possible for the bias in multi-time clones to negatively impact fate probability prediction.The results for fate B are identical and hence are omitted.The lack of impact on LineageOT-MT at rates less than 0.3, where the size of the bias effect is larger, may be explained by the methodological "convergence" of LineageOT-MT to LineageOT as the sampling rate decreases.

Figure 7 .
Figure 7.An illustration of label propagation: the left image illustrates a raw partially labeled dataset where the dots are data points colored by their label.The grey dots are the unlabeled points and the background color shows the true distribution of each cluster.The right image shows the same dataset after performing successful label propagation to share labels from the labeled points to the unlabeled points, resulting in a fully labeled dataset.